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DISTRIBUTED CLUSTERING METHOD AND SYSTEM 
Field of the invention 

The present invention relates generally to data clustering and more 

specifically to a method and system for distributed data clustering. 

Background of the Invention 

There has been the general notion of performing the data clustering in 
parallel by more than one computer to increase the efficiency of the data clustering. 
This is particularly important as the data sets increase in the number of data points 
that need to be clustered, and the case of naturally distributed data (e.g., data for an 
international company with offices in many different offices located different 
countries and locations). Unfortunately, the known data clustering techniques were 
developed for execution by a single processing unit. Furthermore, although there 
have been attempts to make these known data clustering techniques into parallel 
techniques, as described in greater detail hereinbelow, these prior art approaches to 
formulate a parallel data clustering technique offer only tolerable solutions, each 
with its own disadvantages, and leaves much to be desired. 

One prior art approach proposes a parallel version of the K-Means clustering 
algorithm. The publication entitled, 'Tarallel Implementation of Vision Algorithms 
on Workstation Clusters," by D. Judd, N. K. Ratha, P. K. McKinley, J. Weng, and A. 
K. Jain, Proceedings of the 12^^ International Conference on Pattern Recognition, 
Jerusalem, Israel, October 1994 describes parallel implementations of two computer 
vision algorithms on distributed cluster platforms. The publication entitled. 
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"Performance Evaluation of Large-Scale Parallel Clustering in NOW 
Environments," by D. Judd, P. K. McKinley, and A. K. Jain, Proceedings of the 
Eighth SIAM Conference on Parallel Processing for Scientific Computing, 
Minneapolis, Minnesota, March 1 997 further presents the results of a performance 
5 study of parallel data clustering on Network of Workstations (NOW) platforms. 

Unfortunately, these publications do not formalize the data clustering 
approach. Furthermore, the procedure for K-Means is described in a cursory fashion 
without explanation of how the procedure operates. Also, the publications are silent 
about whether the distributed clustering technique can be generalized, and if so, how 
10 the generalization can be performed, thereby limiting the applicability of the Judd 
approach to K-Means data clustering. 

Another prior art approach proposes non-approximated, parallel versions of 
K-Means. The publication, "Parallel K-means Clustering Algorithm on NOWs," by 
Sanpawat Kantabutra and Alva L. Couch, NECTEC Technical Journal, Vol. 1, No. 

15 1, March 1999 describes an example of this approach. Unfortunately, the 
Kantabutra and Couch algorithm requires re-broadcasting the entire data set to all 
computers for each iteration. Consequently, this approach may lead to heavy 
congestion in the network and may impose a communication overhead or penalty. 
Since the trend in technology is for the speed of processors to improve faster than 

20 the speed of networks, it is desirable for a distributed clustering method to reduce the 
amount of data that needs to be communicated between the computers in the 
network. 
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Furthermore, the number of slave computing units in this algorithm is limited 
to the number of clusters to be found. Also, an analytical and empirical analysis of 
this approach estimates a 50% utilization of the processors. It would be desirable for 
a distributed clustering method that has a greater percentage of utilization of the 
processors. 

Accordingly, there remains a need for a method and system for data 
clustering that can utilize more than one computing unit for concurrently processing 
the clustering task and that overcomes the disadvantages set forth previously. 
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SUMMARY OF THE INVENTION 
According to one aspect of the present invention, a distributed data clustering 
method and system are provided for performing the data clustering in a parallel 
fashion instead of a sequential fashion. 

According to another aspect of the present invention, a distributed data 
clustering method and system are provided for utilizing a network of computing 
resources that are either homogenous computing resources or heterogeneous 
computing resources. 

According to yet another aspect of the present invention, a distributed data 
clustering method and system are provided for using a network of computers to 
efficiently process large data sets that need to be clustered. 

According to another aspect of the present invention, a distributed data 
clustering method and system are provided for using a network of computers to 
efficiently process non-distributed data that need to be clustered. 

According to yet another aspect of the present invention, a distributed data 
clustering method and system are provided for using a network of computers to 
efficiently process naturally distributed data that need to be clustered. 

A distributed data clustering system having an integrator and at least two 
computing units. Each computing unit is loaded with common global parameter 
values and a particular local data set. Each computing unit then generates local 
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sufficient statistics based on the local data set and global parameter values. The 
integrator employs the local sufficient statistics of all the computing units to update 
the global parameter values. 



Attorney Docket No. 10001360-1 

-7- 

BRIEF DESCRIPTION OF THE DRAWINGS 
The present invention is illustrated by way of example, and not by way of 
limitation, in the figures of the accompanying drawings and in which hke reference 
numerals refer to similar elements. 

FIG, 1 illustrates a distributed data clustering system configured in 
accordance with one embodiment of the present invention, where the distributed data 
clustering system has homogenous computing components. 

FIG. 2 illustrates a distributed data clustering system configured in 
accordance with an alternative embodiment of the present invention, where the 
distributed data clustering system has heterogeneous computing components. 

FIG. 3 is a block diagram illustrating in greater detail the integrator of FIG. 1 
or FIG. 2 in accordance with one embodiment of the present invention, 

FIG. 4 is a block diagram illustrating in greater detail a computing unit of 
FIG. 1 or FIG. 2 in accordance with one embodiment of the present invention. 

FIG. 5 is a block diagram illustrating the information communicated between 
the integration program and computing program according to one embodiment of the 
present invention. 

FIG. 6 is a flowchart illustrating the processing steps performed by the 
distributed data clustering system according to one embodiment of the present 
invention. 
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FIG. 7 illustrates an exemplary timeline for a single iteration of the 
distributed clustering method in a network without a broadcasting feature. 

FIG. 8 illustrates a timeline for a single iteration of the distributed clustering 
method in a network with a broadcasting feature. 

FIG. 9 illustrates exemplary data clustering methods that can be performed in 
the distributed manner as set forth by the present invention. 
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DETAILED DESCRIPTION 
In the following description, for the purposes of explanation, numerous 
specific details are set forth in order to provide a thorough understanding of the 
present invention. It will be apparent, however, to one skilled in the art that the 
5 present invention may be practiced without these specific details. In other instances, 
well-known structures and devices are shown in block diagram form in order to 
avoid unnecessarily obscuring the present invention. The following description and 
the drawings are illustrative of the invention and are not to be construed as limiting 
the invention, 

10 Homogenous Distributed Data Clustering System 1 00 

FIG. 1 illustrates a distributed data clustering system 100 configured in 
accordance with one embodiment of the present invention, where the distributed data 
clustering system has homogenous computing components. The system 100 
includes an integrator 110 and at least two computing units 120 that communicate 

15 via a network. In this example, there is a plurality of computing units 120 (e.g., L 
computing units labeled CU_1, CU_2, , CU_L). The computing units 120 can be 
disposed in a distributed fashion and be remote from the integrator 110 and from the 
other computing units 120. It is noted that there can be a single computing imit 120 
in a particular location, or a plurality of computing units 120 can be disposed in a 

20 particular location. It is noted that the integrator 110 can be implemented by a 
computing unit 120 that executes an integration program as described in greater 
detail hereinafter. 
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It is noted that the computing units 120 in this system are homogenous in 
nature (i.e., similar in processing capabilities, hardware configuration, etc.). In a 
system having a plurality of homogenous computing units 120, the processing can be 
divided evenly among the units 120. However, as described in greater detail 
hereinafter, when the computing units 120 are not homogenous, there may be an 
opportunity to allocate the processing task in such a manner as to consider the 
computing power of the computing unit. 

Heterogeneous Distributed Data Clustering Svstem 

FIG. 2 illustrates a distributed data clustering system 200 configured in 
accordance with an alternative embodiment of the present invention, where the 
distributed data clustering system has heterogeneous computing components. 

It is noted that the computing components (e.g., computing units 220 and 
integrator 210) in the system are heterogeneous in nature. In this embodiment, at 
least one or more of the computing components is different from the other 
components. For example, the computing components can be different in processing 
capabilities, hardware configuration, etc. In a distributed data clustering system 
having heterogeneous computing components, it is preferable to assign a particular 
processing task to a specific computing unit based on the requirements of the 
processing task and the processing power of the computing unit. Specifically, the 
processing power and capabilities of the computing units can be correlated to the 
specific requirements of the computation tasks when assigning different projects to 
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the different units. 

For example, more powerful computing units can be assigned a greater 
number of data points to process, and a less powerful computing units can be 
assigned a fewer number of data points to process so that all the computing units 
complete execution of their respective assigned tasks at about the same time. By so 
doing, the system can more efficiently manage the load of the clustering task and 
reduce the amount of time waiting for partial resuhs that are needed to generate the 
final result. 

Integrator 

FIG. 3 is a block diagram illustrating in greater detail the integrator 110 or 
210 of FIG. 1 or FIG. 2, respectively, in accordance with one embodiment of the 
present invention. The integrator 1 10 or 210 includes a processor 3 10 for executing 
computer programs and a memory 320 for storing information and programs. The 
memory 320 includes data points 330 to be clustered and an integration program 
340. 

When executing on the processor 310, the integration program 340 provides 
different data points common parameters (e.g., global parameter values) to each of 
the computing units, receives local information provided by the computing units 
120, integrates the local information to generate global sufficient statistics, and 
based thereon updates the global parameter values (GPVs) that are subsequently 
provided to each of the computing units 120 for a next iteration. The distributed 
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clustering method is described in greater detail hereinafter with reference to FIG. 6. 

The integration program 340 includes a data splitter (DS) 344 for splitting 
the data points 330 into local data points that are provided to a particular computing 
unit. The integration program 340 also includes a global sufficient statistics 
determination unit (GSSDU) 348 for receiving the local sufficient statistics from the 
computing units and based thereon generates global sufficient statistics. The 
integration program 340 includes a global parameter value determination unit 
(GPVDU) 350 for receiving the global sufficient statistics from the GSSDU and 
based thereon for generating the global parameter value (GPV) that are subsequently 
provided to each of the computing units in the next iteration. The computing unit is 
described in greater detail hereinafter with reference to FIG, 4, and the information 
communicated between the integration program 340 and the computing program 438 
is described in greater detail hereinafter with reference to FIG. 5. 

A communication interface (I/F) 324 is also provided for the integrator 110 
or 210 to communicate with the computing units through a network. 

Computing Unit 

FIG. 4 is a block diagram illustrating in greater detail a computing unit 120 
or 210 of FIG, 1 or FIG. 2 in accordance with one embodiment of the present 
invention. The computing unit 120 or 210 includes a processor 410 for executing 
computer programs and a memory 420 for storing information and programs. The 
memory 420 includes local data 430 and a computing program 438. 
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When executing on the processor 310, the computing program 438 receives 
the global parameters (e.g., center locations) and the data points assigned to this 
computing unit, and generates local information (e.g., local sufficient statistics) and 
based on these inputs. 

The computing program 438 includes a local sufficient statistics 
determination unit (LSSDU) 440 for determining local sufficient statistics based on 
the inputs. The LSS are subsequently provided to the integrator for further 
processing. A communication interface (I/F) 424 is also provided for the computing 
unit 120 or 220 to communicate with the integrator through a network. 

FIG. 5 is a block diagram illustrating the information communicated between 
the integration program and computing program according to one embodiment of the 
present invention. The data splitter 344 receives the data points that need to be 
clustered and partitions the data points in a plurality of subsets of data points (e.g., 
local data_l, local data_2, .. , local data_L). Each of these subsets of local data are 
then provided to a designated computing unit as shown. It is also noted that each of 
the computing units also receives common global parameter values (e.g., the initial 
or updated center locations). Based on the GPV and the specific local data, each 
computing unit generates local sufficient statistics (LSS) and provides the same to 
the GSSDU 348. 

The GSSDU 348 receives the local sufficient statistics (LSS), such as LSS_1, 
LSS_2, .. , LSS_L, from each of the computing units (1 to L), respectively, and 
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generates global sufficient statistics (GSS) based thereon. The GPVDU 350 receives 
the GSS and employs the GSS to update or revise the GPV, which is subsequently 
provided to the computing units in the next iteration. 

The distributed clustering technique of the present invention is especially 
5 suited to handle naturally distributed data or data where the number of data points is 
too large for a single processor or computing machine to handle. In the first case, 
distributed clustering technique of the present invention is well suited to process 
naturally distributed data. An example of naturally distributed data is customer data, 
sales data, and inventory data for an international company with offices in many 
10 different offices located different countries and locations. In such an example, the 
data splitter 344 is not needed and each computing unit can have access to its own 
local data. In response to the local data and GPV provided by GPVDU 350, each 
computing unit generates local sufficient statistics to provide to GSSDU 348. 

In this second case, the data is split into manageable tasks where each task 
15 can be adequately processed by the computing units. In this manner, a task that may 
be too big for any one computing unit can be split into sub-tasks that can be 
efficiently processed by each of the computing units. 

Distributed Clustering Method 

FIG. 6 is a flowchart illustrating the distributed clustering method according 
20 to one embodiment of the present invention. In step 600, a data set having a 
plurality of data points is partitioned into at least two partitions or sub-sets of data 
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points. In this example, there are L partitions or sub-sets of data points 
corresponding to L computing units. 

By finding K centers (local high densities of data), M= {m, \ i^l,...,K}, the 
data set, 5 = {x, | i^\,,,,,N}, can be partitioned either into discrete clusters using the 
5 Voronoi partition (each data item belongs to the center that it is closest to, as in K- 
Means) or into fuzzy clusters given by the local density functions (as in K-Harmonic 
Means or EM). 

Step 600 can also include the step of loading the partitions into respective 
computing units and initializing the global parameter values (GPVs). It is noted that 
10 the step of splitting the data into partitions may be optional since the data may be 
naturally distributed. 

In step 604, the GPVs are sent to all the computing units. In step 610, each 
computing unit determines local sufficient statistics (LSS) based on the received 
GPVs sent by the integrator and the specific partition or sub-set of data points loaded 
15 by the integrator. The LSS are then sent to the integrator. 

As used herein the term "sufficient statistics" shall be defined as follows: 
quantities S can be called sufficient statistics of F, if the value of F can be uniquely 
calculated or determined from S without knowing any other information. In other 
words, S is sufficient for calculating F. For further information regarding sufficient 
20 statistics, please refer to Duda & Hart, "Pattern Classification and Scene Analysis," 
John Wiley & Sons, pages 59-73. 
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In step 620, the integrator receives the LSS from each of the L computing 
units and based thereon generates global sufficient statistics (GSS). The LSS from 
the computing units are "integrated" to generate GSS. In step 630, the GSS are 
utilized to generate updated global parameter values (GPVs). 

5 In step 640, the integrator checks the convergence quality of the resulting 

center locations from the current iteration. In step 650, a determination is made 
whether the convergence quality meets a predetermined standard. If the convergence 
quality meets the predetermined standard, processing proceeds to step 660 where the 
computing units 120 are instructed to stop processing and the center locations are 
10 provided as outputs. If the convergence quality does not meet the predetermined 
standard, processing proceeds to step 604 where are provided to the computing units 
120 in preparation for a next iteration. 

Timeline for Network without a Broadcast Feature 

FIG. 7 illustrates an exemplary timeline 700 for a single iteration of the 
15 distributed clustering method in a network without a broadcasting feature. The 
horizontal axis 710 represents time, and the vertical axis 720 represents a plurality of 
computing units. Each iteration can be divided into four phases. In phase A, data 
set re-partitioning and local sufficient statistics collection are performed. In phase B, 
the local sufficient statistics are sent across the network to the integrator. In phase C 
20 the integrator performs integration of the LSSs and generation of updated GPVs. In 
phase D, the integrator propagates the updated GPVs to each of the computing units. 
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The first computing unit 740 is assigned the duties of the integrator. 
Consequently, the first computing unit 740 is provided with a lighter load of data 
(e.g., a partition or sub-set of data having fewer data points). The slope of the line 
indicated in phase B is generally different from the slope of the Hne indicated in 
5 phase D since the amount of data that is sent by the integrator to the computing units 
is not the same as the data that is sent by the computing units to the integrator. 

Timeline for Network with Broadcast Feature 

FIG. 8 illustrates a timeline 800 for a single iteration of the distributed 
clustering method in a network with a broadcasting feature. The horizontal axis 810 

10 represents time, and the vertical axis 820 represents a plurality of computing units 
830. Each iteration can be divided into four phases. In phase A, data set re- 
partitioning and local sufficient statistics collection can be performed. In phase B, 
the local sufficient statistics are sent across the network to the integrator. In phase C 
the integrator performs integration of the LSSs and generation of updated GPVs. In 

15 phase D, the integrator broadcasts the updated GPVs to each of the computing units. 
The broadcasting feature allows all the computing mits to begin phase A at the same 
time without having a staggered start time as illustrated in FIG. 7. 

For example, when the network of computers is Ethernet, the broadcasts can 
employ the broadcasting features of Ethemet. First, at a low level, there is a 
20 broadcast protocol for Ethernet. Second, Parallel Virtual Machine (PVM) and 
Message Passing Interface (MPI) (and their various implementations) typically have 
routines to broadcast messages that can be utilized. Third, shared memory protocols 
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also support broadcasting. For example, a network interface card on each host may 
contain some memory, whose contents are replicated across all cards in the ring of 
hosts. In this regard, writing into the memory on one card replicates data to all other 
cards' memories. 

5 Exemplary Data Clustering Systems 

FIG. 9 illustrates a data clustering system 900 having exemplary data 
clustering systems that are implemented in a distributed manner according the 
present invention. The system 900 can include a distributed expectation 
maximization clustering system (DEMCS) 910 for implementing an EM clustering 

10 algorithm in a distributed manner. The DEMCS 910 includes one or more EM 
computing units 918 and an EM integrator 914 for providing weighting parameters 
and a co-variance matrix as global parameter values to the EM computing units 918, 
For further details regarding the EM clustering algorithm, please refer to Dempster, 
A. P., Laird, N.M., and Rubin, D.B., "Maximum Likelihood from Incomplete Data 

15 via the EM Algorithm", J, of the Royal Stat. Soc, Series B, 39(l):l-38, 1977, and 
McLachlan, G. J., Krishnan, T., "The EM Algorithm and Extensions", John Wiley & 
Sons, 1996. 

The system 900 can include a distributed K-Means clustering system 
(DKMCS) 920 for implementing a KM clustering algorithm in a distributed manner. 
20 The DKMCS 920 includes one or more KM computing units 928 and a KM 
integrator 924 for providing center locations as global parameter values to the KM 
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computing units 928. For further details regarding the K-Means clustering algorithm, 
please refer to "Some methods for classification and analysis of multivariate 
observations," by J. MacQueen, Proceedings of the 5th Berkeley Symposium on 
Mathematical Statistics and Probability, Vol. 1., pages 281-297, University of 
5 California Press, Berkeley XVII + 666 p, 1 967. 

The system 900 can include a distributed K-Harmonic Means clustering 
system (DKHMCS) 930 for implementing a KHM clustering algorithm in a 
_ distributed manner. The DKHMCS 930 includes one or more KHM computing 

51 units 938 and a KHM integrator 934 for providing center locations as global 

10 parameter values to the KHM computing units 938. For further details regarding the 
|ii K-Harmonic Means clustering algorithm, please refer to "K-Harmonic Means - A 

Data Clustering Algorithm" by Zhang, B., Hsu, M., Dayal, U., Hewlett-Packard 
Research Laboratory Technical Report HPL-1999-124. 

-■^^ 

^ It is noted that other data clustering systems can be implemented in the 

15 distributed manner as set forth by the present invention. The distributed data 
clustering systems 910, 920, 930 implement center-based data clustering algorithms 
in a distributed fashion. However, these center based clustering algorithms, such as 
K-Means, K-Harmonic Means and EM, have been employed to illustrate the parallel 
algorithm for iterative parameter estimations of the present invention. In these 
20 examples, the center locations are the parameters, which are estimated through many 
iterations. However, the distributed iterative parameter estimation techniques of the 
present invention are not limited to application to clustering algorithms. 
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A Class of Center-Based Algorithms 

Let 7?"^"" be the Euclidean space of dimension dim; S c 7^"^"" be a finite 

subset of data of size \S\; and M ^ {ntk \ k^l ....K}, the set of parameters to 
be optimized. For example, the parameter set M consists of K centroids for K- 
Means, K centers for K-Harmonic Means, and K centers with co-variance 
matrices and mixing probabilities for EM. The performance function and the 
parameter optimization step for this class of algorithms can be written in terms 
of SS. The performance function is decomposed as follows: 

Performance Function: p^sM) - /o(Z/i(^^^)>Z/2(^^^X .Y^f^i^.M)), (1) 

What is essential here is that fo depends only on the SS, represented by 
the sums, whereas the remaining/ functions can be computed independently for 
each data point. The detailed form of /, i=l,,,.,R, depends on the particular 
performance function being considered, which is described in greater detail 
hereinafter with examples of K-Means, K-Harmonic Means and EM. 

The center-based algorithm, which minimizes the value of the 
performance function over M is written as an iterative algorithm in the form of 
Q SS {IQ stands for the iterative algorithm, and J=h ...,Q, stands for SS.): 

xeS xeS xgS 

il/"^ is the parameter vector after the u^^ iteration. We are only interested in 
algorithms that converge: il/"^ M The values of the parameters for the 0^^ 
iteration, a/^\ are by initialization. One method often used is to randomly 
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initialize the parameters (e.g., centers, co variance matrices and/or mixing 
probabilities). There are many different ways of initializing the parameters for 
particular types of center-based algorithms. The set of quantities 

Suff = iY^Mx^M) I r = 1, ,R}[J{^g^(xM) I q - 1, 

XSS X€S 

5 is called the global SS of the problem (l)+(2). As long as these quantities are 
available, the performance function and the new parameter values can be 
calculated, and the algorithm can be carried out to the next iteration. As 
described in greater detail hereinfater, K-Means, K-Harmonic Means and 
Expectation-Maximization clustering algorithms all belong to a class defined in 

10 (l)+(2). 

Parallelism of Center-Based Algorithms 

This decomposition of center-based algorithms (and many other iterative 

parameter estimation algorithms) leads to a natural parallel structure with minimal 
15 need for communication. Let L be the number of computing units that has a CPU 
and local memory (e.g., personal computers (PCs), workstations or multi-processor 
computers). To utilize all L units for the calculation of (l)+(2), the data set is 

partitioned into L subsets, iS' Z)j u u ^ D^, and the subset, D/, resides on 

the unit. It is important not to confuse this partition with the clustering, 
20 This partition is arbitrary and has nothing to do with the clustering in the 

data. This partition is static. Data points in Di, after being loaded into the memory 
of the computing unit, need not be moved from one computer to another except 
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perhaps for the purpose of load balancing among units, whose only effect is on the 
execution time of the processors (i.e., load balancing does not affect the algorithm). 

The processing unit can be homogeneous processing units or heterogeneous 
processing units. The sizes of the partitions, \Dr\, besides being constrained by the 
5 storage of the individual units, are preferably set to be proportional to the speed of 
the computing units. Such a partition ensures that each unit takes about the same 
amount of time to finish its computation on each iteration, thereby improving the 
utilization of all the units. A scaled-down test may be carried out on each computing 
unit in advance to measure the actual speed (do not include the time of loading data 

10 because it is only loaded once at the beginning) or a load balancing over all units 
could be done at the end of first iteration. 

The calculation is carried out on all L units in parallel. Each subset, Di, 
contributes to the refinement of the parameters in M in exactly the same way as the 
algorithm would have been run on a single computer. Each unit independently 

15 computes its partial sum of the SS over its data partition. The SS of the partition 
are 

Sujf,={Yfr{^.M)\r^\, ,R}\J{Y,g,{xM)\q^\. , 0 • (4) 

x&D, xeDj 

One of the computing units is chosen to be the Integrator. The integrator is 
responsible for summing up the SS from all partitions (4), obtaining the global SS 
20 (3); calculating the new parameter values, M, from the global SS; evaluating the 
performance function on the new parameter values, (2); checking the stopping 
conditions; and informing all units to stop or sending the new parameters to all 
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computing units to start the next iteration. The duties of the Integrator may be 
assigned as a part time job to one of the regular units. There may also be more than 
one computer used as Integrators. The multiple integratos can be organized in a 
hierarchy if the degree of parallelism is sufficiently high. Special networking support 
5 is also an option. If broadcast is supported efficiently, it may be effective to have 
every node be an Integrator, thereby eliminating one direction of communication. 

The Parallel Clustering Algorithm: 

Step 0; Initialization: Partition the data set and load the 1* partition to the 
10 memory of the 1^^ computing unit. Use any preferred algorithm to 

initialize the para-meters, {mk}, on the Integrator. 

Step 1: Broadcast the integrated parameter values to all computing units. 

Step 2: Compute at each unit independently the SS of the local data based on (4). 

Step 3: Send SS from all units to the Integrator 
15 Step 4: Sum up the SS from each unit to get the global SS, calculate the new 
parameter values based on the global SS, and evaluate the performance 
function. If the Stopping condition is not met, goto Step 1 for the next 
iteration, else inform all computing units to stop. The stopping condition 
typically tests for sufficient convergence or the number of iterations. 

20 

Examples 

The application of the techniques of the present invention to three 
exemplary clustering algorithms: K-Means, K-Harmonic Means, and EM, is 
now described. 

25 

K-Means Clustering Algorithm 

K-Means is one of the most popular clustering algorithms. The algorithm 

partitions the data set into K clusters, S ^ (Sj, ,Sk), by putting each data point 

into the cluster represented by the center nearest to the data point. K-Means 
30 algorithm finds a local optimal set of centers that minimizes the total within 
cluster variance, which is K-Means performance function: 
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Perf^(X,M) = j:j:\\x-m,\\\ (5) 
where the center, m^, is the centroid of the partition. The double 

summation in (5) can instead be expressed as a single summation over all data 

points, adding only the distance to the nearest center expressed by the MIN 

5 function below: 

Perf^{X,M) f]MIN{\\ ~ m, f\ I - i,...,^}. (6) 
(=1 

The K-Means algorithm starts with an initial set of centers and then iterates 
through the following steps: For each data item, find the closest mj, and assign 
the data item to the A:^^ cluster. The current mk% are not updated until the next 
10 phase (Step 2). 

1 . Recalculate all the centers. The A:*^ center becomes the centroid of the yi^^ 
cluster. This phase gives the optimal center locations for the given partition 
of data. 

2. Iterate through 1 & 2 until the clusters no longer change significantly. 

15 After each phase, the performance value never increases and the 

algorithm converges to a local optimum. More precisely, the algorithm reaches a 
stable partition in a finite number of steps for finite datasets. The cost per 
iteration is O(K'dim-N), 

The fimctions for calculating both global and local SS for K-Means are 

20 the 0^^, 1^^ and 2^^ moments of the data on the unit belonging to each cluster as 

shown in (7). Both the K-Means performance fimction and the new center 

locations can be calculated from these three moments. 

\g,{xM) = MxM) = id,{x),5,{x\ ,5^ {x)\ 

g^ixM)^ h (X, M) ^f,{xM)- X, 
\ g, (X, M) = f, (x, M) = f,{x,M)-x\ 
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(3cj = i if X is closest to rrik otherwise P^k (x) 0 (resolve ties 
arbitrarily). The summation of these functions over a data set (see (3) and (4)) 
residing on the unit gives the count, n^,/, first moment, P^kj, and the second 

moment, s^j, of the clusters. The vector {rikj, P^k,h Skj, \ k=l ,K}, has 

5 dimensionaUty 2-K-^K-dim, which is the size of the SS that have to be 
communicated between the Integrator and each computing unit. 

The set of SS presented here is more than sufficient for the simple 
version of K-Means algorithm. The aggregated quantity, P^k^kj, could be sent 
instead of the individual Skj^ But there are other variations of K-Means 
10 performance functions that require individual s^j, for evaluating the performance 
functions. Besides, the quantities that dominate the communication cost are 

The computing unit collects the SS, { rikj, P' k,h Skj, I k=l, ,K }, on 

the data in its own memory, and then sends them to the Integrator. The 
15 Integrator simply adds up the SS from each unit to get the global SS, 

^ I L 

/=] /=! /=1 

The leading cost of integration is 0(K'dim-L), where L is the number of 
computing units. The new location of the k^^ center is given by mk =A k/nk from 
the global SS (this is the IQ function in (2)), which is the only information all the 
20 computing units need to start the next iteration. The performance function is 
calculated by (proof by direct verification), perf^^ = T * 
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The parallel version of the K-Means algorithm gives exactly the same 
result as the original centralized K-Means because both the parallel version and 
the sequential version are based on the same global SS except on how the global 
SS are collected. 

K-Harmonic Means Clustering Algorithm 

K-Harmonic Means is a clustering algorithm that features an insensitivity 
to the initialization of the centers. In contrast to the K-Means clustering 
algorithm whose results depend on finding good initializations, K-Harmonic 
Means provides good results that are not dependent on finding good 
initializations. 

The iteration step of the K-Harmonic Means algorithm adjusts the new 
center locations to be a weighted average of all x, where the weights are given by 



(K-Means is similar, except its weights are the nearest-center 
membership fimctions, making its centers centroids of the cluster.) Overall then, 
the recursion equation is given by 



where dx,k ^\\x-mk\ \ and ^ is a constant 4. The decomposed functions 
for calculating SS (see (3) and (4)) are then 




(8) 
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(10) 



Each computing unit collects the SS, 



(11) 



on the data in its own memory, and then sends it to the Integrator. The 
size of the SS vector is K^K-dim {gs is a matrix). The Integrator simply adds up 
the SS from each unit to get the global SS. The new centers are given by the 
component-wise quotient: 



which is the only information the units need to start the next iteration. 
This calculation costs O(K-dim'L), The updated global centers are sent to each 
unit for the next iteration. If broadcasting is an option, this is the total cost in 
time. If the Integrator finds the centers stop moving significantly, the clustering 
is considered to have converged to an optimum, and the units are stopped. 

Expectation-Maximization (EM) Clustering Algorithm 

In this example, the EM algorithm with linear mixing of K bell-shape 
(Gaussian) functions is described. Unlike K-Means and K-Harmonic Means in 
which only the centers are to be estimated, the EM algorithm estimates the 
centers, the co-variance matrices, Eu, and the mixing probabilities, p(m0. The 
performance fimction of the EM algorithm is 
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15 



■EXP{-{x-m,)J:i\x-m,f) 



(13) 



^(2;r)^det(S,) 

where the vector p =(pi, p2y , Pid is the mixing probability, EM 

algorithm is a recursive algorithm with the following two steps: 
E-Step : Estimating "the percentage of x belonging to the A:^"^ cluster", 

p{mj^ \x)=^ p{x\m^)- p{m^ ) /^^ p{x \mj^)- p{m,^ ) , (14) 

where p(x\m) is the prior probability with Gaussian distribution, and 
p(nik) is the mixing probability. 

1 ..... . .^-w ... (15) 



p{x\m^) = - 



•EXP{~{x~m,)Lt{x-m,Y) 



^(2;r)^det(S,) 

M-Step : With the fuzzy membership function from the E-Step, find the 
10 new center locations, new co-variance matrices, and new mixing probabilities 
that maximize the performance function. 

Y,p{m,\x)-x ^p{m,\x)-{x-m,y{x~m,) \ (16-18) 

The functions for calculating the SS are: 



/,(x,M,S,/?) = -log 



g, {x, M, 2, p) = {p(m^ 1 x\ pim^ \ x), ,p{m^ \ x)) 

g^(x,M,T,,p)=^ip{m^ \x)x,p(m2\x)x, ,p{m^ \x)x) 

g^{x,M,ll,p) = (p{m^ I x)x^x,p{m2 \ x)x^x, ,p{m^ \ x)x^x) 

The vector length (in number of scalars) of the SS is 1+K^K-dim -^K-dim^. 

The global SS is also the sum of the SS from all units. The performance function 

value is given by the first global sufficient statistic. The global centers are from the 

component- wise "ratio" of the third and the second global SS (see (16)), the co- 
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variance matrices from (17) and the mixing probability from (18). All these 

quantities, {rrik, p(m^ \ k = 1, ,K}, have to be propagated to all the units at the 

beginning of each iteration. The vector length is K+K-dim ^K-dim^. 

There are numerous applications that can utilize the distributed clustering 
5 method and system of the present invention to cluster data. For example, these 
applications include, but are not limited to, data mining applications, customer 
segmentation applications, document categorization applications, scientific data 
analysis applications, data compression applications, vector quantization 
applications, and image processing applications. 

10 The foregoing description has provided examples of the present invention. It 

will be appreciated that various modifications and changes may be made thereto 
without departing from the broader scope of the invention as set forth in the 
appended claims. The distributed clustering method and system described 
hereinabove is not limited to data clustering algorithms, but can, for example, be 

15 applied to distributed parametric estimation applications (e.g., statistical estimation 
algorithms that use sufficient statistics). Furthermore, the distributed parameter 
estimation techniques of the present invention can be applied not only to large data 
sets, but also can be applied to naturally distributed data (e.g., environmental data, 
geological data, population data, governmental data on a global worldwide scale). 

20 Applications to collect, process and analyze the naturally distributed data, can 
advantageously utilize the distributed parameter estimations techniques of the 
present invention. These applications can include, for example, geological survey 
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applications, environment monitoring applications, corporate management 
applications for an international company and economic forecast and monitoring 
applications. 



